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ABSTRACT: The genomic data that can be collected from a single 

DNA molecule by the best chemical and optical methods (e.g., using tech- 
nologies from OpGen, BioNanoGenomics, NABSys, PacBio, etc.) are badly 
corrupted by many poorly understood noise processes. Thus, single molecule 
technology derives its utility through powerful probabilistic modeling, which 
can provide precise lower and upper bounds on various experimental parame- 
ters to create the correct map or validate sequence assembly. As an example, 
this analysis shows how as the number of "imaged" single molecules (i.e., 
coverage) is increased in the optical mapping data, the probability of suc- 
cessful computation of the map jumps from 0 to 1 for fairly small number of 
molecules. 



1 Some Preliminary Remarks 

Optical Mapping [AMS97, Ana+97b, AnMS99, AsMS99, Cai+98, Mishra03, Sam+95] 
is an approach that generates an ordered restriction map of a DNA molecule (e.g., a 
genome [Jing+99, Lai+99, Lim+01, lin+99, Zhou+02] or a clone [Cai+98, Giacalone+00, 
Sam+95, Skiadas+99]). The resulting restriction map is represented as an ordered enu- 
meration of the restriction sites along with the estimated lengths of the restriction frag- 
ments between consecutive restriction sites and various related statistics. These statistics 
accurately model the errors in estimating the restriction fragment lengths as well as the 
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Health and New York State Office of Science, Technology & Academic Research. 
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1 SOME PRELIMINARY REMARKS 2 

errors due to unrepresented and misrepresented restriction sites in the map. These physi- 
cal maps have found applications in improving the accuracy and algorithmic efficiency of 
sequence assembly, validating assembled sequences, characterizing gaps in the assembly 
and identifying candidates for finishing steps in a sequencing project. Also, because of 
its inherent simplicity and scalability as well as its reliance on single molecules, optical 
mapping also provides a fast method for moderate resolution karyotyping and haplotyp- 
ing. 

The physico-chemical approach underlying optical mapping is based on immobilizing 
long single DNA molecules on an open glass surface, digesting the molecules on the surface 
and visualizing the gaps created by restriction activities using fluorescence microscopy. 
Thus the resulting image, in the absence of any error, would produce an ordered sequence 
of restriction fragments, whose masses can be measured via relative fluorescence intensity 
and interpreted as fragment lengths in base pairs. The corrupting effects of many inde- 
pendent sources of errors affect the accuracy of an optical map created from one single 
DNA molecule, and can only be tamed by combining the optical maps of many single 
molecules covering completely or partially the same genomic region and by incorporating 
accurate statistical models of the error sources. To a rough approximation, the insur- 
mountable obstacles in the chemistry is circumvented by cleverly exploiting the statistical 
properties of the system through a "0-1 Law" in the parameter space. This law plays a 
crucial role at the heart of the entire optical mapping technology and is likely to reappear 
in other contexts as well, e.g., array-mapping, single-molecule sequencing and haplotyp- 
ing. In this paper, we focus on one such law in the context of mapping clones and we 
hint at how these results are generalized to genomic mapping. 

The main error sources limiting the accuracy of an optical map are due to either in- 
correct identification of restriction sites or incorrect estimation of the restriction fragment 
lengths. Since these error sources interact in a complex manner and involve resolution of 
the microscopy, imaging and illumination systems, surface conditions, image processing 
algorithm, digestion rate of the restriction enzyme and intensity distribution along the 
DNA molecule, statistical Bayesian approaches are used to construct a consensus map 
from large number of imperfect maps of single molecules. In the Bayesian approach, the 
main ingredients are as follows: (1) A model of the map of restriction sites (Hypothesis, 
H) and (2) A conditional probability distribution function for the single molecule map 
data given the hypothesis (Conditional pdf , f(D\H)). The conditional pdf models the 
restriction fragment sizing error in terms of a Gaussian distribution, the missing restric- 
tion site event (due to partial digestion) as a Bernoulli trial and the appearance of false 
restriction sites as a Poisson process. Using the Bayes' formula, the posterior conditional 

pdf f(H\D) = f(D\H) jTqn i s computed and provides the means for searching for the 

best hypothetical model given the set of single molecule experimental data. Since the 
underlying hypothesis space is high dimensional and the distributions are multi-modal, 
a naive computational search must be avoided. An efficient implementation involves ap- 
proximating the modes of the posterior distribution of the parameters and accurate local 
search implemented using dynamic programming [AMS97, AnMS99]. The correctness 
of the constructed map depends crucially on the choice of the experimental parameters 
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2 PROBLEM FORMULATION 3 

(e.g., sizing error, digestion rate, number of molecules). Thus, the feasibility of the entire 
method can be ensured only by a proper experimental design. 

This paper studies several simple models for optical mapping and explores their power 
and limitations when applied to the construction of maps of clones (e.g., lambdas, cosmids, 
BACs and YACs), by providing precise lower and upper bounds on the number of clone 
molecules needed to create the correct map of the clone. Our probabilistic analysis proves 
the existence of a 0-1 laws in the number of molecules. 

The paper is organized as follows: In section 2, we formulate the problem; in sections 3, 
4 and 5, we successively introduce and analyze the effects of various error sources: namely, 
partial digestion error, misorientation error and quantization error, respectively. We use 
probabilistic methods to provide upper and lower bounds on the choices of parameters 
that would ensure correct result with high probability. In section 6, we study the effect 
of sizing error and its interaction with discretization. The analysis indicates that for a 
reasonable choice of sizing error, the algorithms based on discretization are unlikely to 
work correctly with any reasonable probability. 

2 Problem Formulation 

The underlying bio-chemical problem concerns with the construction of an ordered re- 
striction map of a clone (a piece of DNA of length L, where L is measured in base-pairs 
bps, Kb = 10 3 bp or Mb = 10 6 6p). Typical values of L are 2-20Kb (lambda's), 20-45^6 
(cosmids) 150-2001T6 (BAC'S) and ^ 1Mb (YAC'S). For our mathematical analysis, we 
will often assume that L takes some fixed value which can be arbitrarily large. These 
clones are sequences of length L over the alphabet { A, T, C, G }. Certain short subse- 
quences (typically of length 6, e.g., GGATCC) can be recognized by a restriction enzyme 
(e.g., BamHI), and location of these restriction sites 

0 < Hi < H 2 < ■ ■ ■ < H k < L 

in the clone is the ordered restriction map of the clone with respect to the given enzyme. 

Let hi = Hi/L be a real number. Then the normalized ordered restriction map of the 
clone with respect to the enzyme is 

0 < hi < h 2 < ■ ■ ■ < hk < 1, 

where each hi assumes some real value in the open unit interval (0, 1). 

Note that in the absence of any additional distinguishing characteristic of the clone 
(e.g., identification of 3' end or 5' end), we could have also taken the following as another 
normalized ordered restriction map of the same clone with respect to the same enzyme: 

0 < /if < ••• < /if < /if < 1, 

where hf = 1 — hi. Note that the normalized ordered restriction map is unique up to 
reversal in the absence of any additional distinguishing characteristic, and is unique if we 
know the orientation. 
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3 False Negative Errors: Partial Digestion 

Let us postulate an experiment, where the desired normalized ordered restriction map is 
observed, subject to partial digestion error and where any particular restriction site is 
observed with some probability p < 1. We assume no other error sources for now; thus 
no other spurious sites (false restriction cuts) are included in the observation and the 
observed restriction map appears in the correct orientation. 

Thus the result of the experiment is an ordered sequence of sites (normalized) 



0 < Sl < S2 < ■ ■ ■ < Si < 1, 



where for each Sj, there is an hj in the true map, such that Si = hj. By assumption, for 
each hj the probability 

Pr [ there exists some Si s.t. hj = Sj] = p. 

Let us also assume that the experiment is repeated n times resulting in n observed 
restriction maps. Assume that the true restriction map is unknown and is to be con- 
structed from these n observations. A straight forward algorithm for doing this would 
be to simply take the union of all the observed restriction sites, and output this result in 
sorted order. 

(In k + c \ 
I then the result of the preceding algorithm is almost 
P ) 

surely correct. Here c is a constant to be determined later. Note that the probability 
that a cut site hj appears in at least one observation is 1 — (1 — p) n > 1 — e~ c /k. 
Thus the probability that all k true cut sites show up in the final map is given by 
(l-e- c /k) k >e- e ~ c + o(l). 

On the other hand, if n < I — I (k > 1 and 0 < p < 0.69) then there is a high 



probability that the amount of data is insufficient to recover the correct map. Note again 
that given a true cut site hj the probability that this cut is never observed in any of the 
n observations is simply (1 — p) n > e~ pn ^ 1+p ^ > e _lnfc = 1/k. Thus, with this value of n, 
the probability that we can recover all the true cut sites is simply bounded from above 

*H)'*:<*- 

In k + c 



In summary: Let e be a positive constant and c > ln(l/e). Then for n > 

V P 

with probability at least (1 — e), the correct ordered restriction map can be computed in 
0(nk) time. 

When n < ( — ] (k > 1 and 0 < p < 0.69), there is a probability greater than 

Vp(i +p)j 

half that no algorithm can compute the correct ordered restriction map. 

Note, however, that since the value of k and p are not known a priori, it is impossible 
to use this result in a meaningful way in designing an experiment (i.e., in choosing n). 
The algorithm itself does not use the parameters k or p; only its success probability is 
determined by these parameters for a fixed set of input data. 
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4 MISORIENTATION ERRORS 

Next, let us postulate a modified experiment, where the desired normalized ordered re- 
striction map is observed, subject to partial digestion error as well as error due to mis- 
orientation. Thus the result of the experiment is an ordered sequence of sites 

0 < s\ < s 2 < • • • < si < 1, 

where either the sequence or its reversal 

0 < *f <«£.!<■■■< sf < 1, 

could be assumed to be derived from the true normalized ordered restriction map 

0 < hi < h 2 < ■ ■ ■ < h k < 1, 
after partial digestion. By assumption, for each hj and for each observation, the proba- 



models the partial digestion. 

Assumption: For the time being, we assume that the true normalized ordered restriction 
map has no symmetric site, i.e., 



Let us also assume that the experiment is repeated n times resulting in n observed 
restriction maps whose orientations may be misspecified. 

An algorithm to reconstruct the true map may proceed in two phases: In the first 
phase, all the molecules are folded by the mid-point, thus creating "folded-maps," in 
which the orientation of the molecule is no longer an issue. The individual "folded-maps" 
are combined to compute a "consensus folded-map." In the second phase, the "consensus 
folded-map" is unfolded back to create the restriction map where each individual site 
is assigned to either left half or right half, by examining the relative locations of pairs 
of restriction sites found in the original data (assuming that enough such information is 
available) . 



bility 



Pr [ there exists some Sj s.t. hj = S{ or hj = sf] = p 



Vj Vj^j hi ^ hj . 



4.1 Phase 1: 



Define a function 



/ 




In phase 1, our goal is to construct the set 



{/(/n),/(M, •••,/(/*)} 
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4 MISORIENTATION ERRORS 6 

which can be easily accomplished by considering the sets 
{/Oil), f(s i2 ), f(sui)}, i = l,...,n. 

and proceeding in a manner similar to the one outlined in the preceding section. Using 
the arguments given earlier, we see that we will succeed in this phase with probability 

.„ . /In + c N 

e , it n > 



P 

4.2 Phase 2: 

While one cannot recreate the map directly from the result of the phase 1, one can invert 
/ correctly, if each computed site is further augmented with a sign value (G {+1,-1}), 
where +1 denotes that the site belongs to the left half [(0, 5)] and —1 denotes that the 
site belongs to the right half [(5, 1)]. Thus, we may define 

/ = (0,^)X {+1,-1} -(0,1) 



(f( h . ) b^hI f ^ if Sgn =+1; 
U(^),bgnj- I f ^ )R if Sgn = _ L 



We can assign the sign values correctly as follows: Define a graph G = (V, E), where 
V = {f(hx), f(h 2 ), . . ., f(h k )} and e = [f(h a ), f(h b )} G E if and only if 

3 ^,a,s li6 f( s i,a) = f(h a ) and f(s ijb ) = f{h b ), for some i. 

Furthermore, label e with +1 if for some i, either Si >a and Sj^ G (0, |) or Sj ja and s^;, G 
(i, 1) (6ot/t sites belong to the same half); and with —1 if for some i, either Si >a G (0, i) 
and s^b G (5, 1) or Sj jQ G 1) and Sj^ G (0, 5) (too sites belong to different halves). In 
other words, 

Sgn (e) = Sgn [(^ - s i>a )(- - s i|6 )]. 

It is trivial to see that if the graph is connected then one can compute the correct 
vertex labels by first labeling an arbitrary vertex +1 (say, f(hj)) and then labeling the 
remaining vertices by following the edge labels during a graph-search process. Thus if 
f(hi) and f(hj) are path connected by a simple path e±, e 2 , . . ., e m then 

Sgn (f(hi)) = Sgn (e x ) ■ Sgn (e 2 ) • • • Sgn (e m )Sgn (/(%)). 

Next we compute an upper bound on the number of observations necessary for G to 
be connected. Let 



n > 



2 In k + 8c 

p2 
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Let iS/jj denote the set of observations with a cut site matching f(hi). The number of 
such observations, \Sf ll \, follows a Binomial distribution ~ S(n,p). 



Pr 



S(n,p) < 



In k + c 
P 



< Pr[S(n,p) < (1 - e 0 )np], e 0 > - 

< e -np/S < g -c_ 

A cut corresponding to every f(hj) [2 < i < k] occurs in an observation in Sfa, when 
In k + c 

LS/j > , with probability 

P 

(1 _ (1 « e -*e- p|s *i l > e -e- c . 

Thus G is connected with a probability > (1 — e~ c )e~ e c as [f (hi) , f (hi)] appears in G 
for all 2 < i < k. 

Summarizing: Let e be a positive constant and c > ln(2/e). Then for 



n > max 



In k + c 2 In + 8c' 



with a probability at least (1 — e), the correct ordered restriction map can be computed 
in 0(nk 2 ) time. Also, see Appendix Al, for a slightly better bound when p = 0(l/k). 

4.3 Optical Cuts 

Next we shall consider the situation where we have additional spurious cuts (optical cuts) 
that do not correspond to any restriction sites. A sound probabilistic model for these 
spurious cuts can be given in terms of a Poisson process with parameters A/ (thus the 
expected number of false cuts per molecule is A/). Hence, for any small region [x, x + 5x] 
in an observation, 

Pr yf false cuts € [x, x + 5x] = 1] = XfSx, 
Pr [# false cuts € [x, x + Sx] > 2] = o(5x). 

The probability that an observation contains exactly / spurious cuts is given by: 
, X f f 

e f '—^r. Typical observed values for Xf are about 0.2 for Lambda clones, 0.5 for cosmids 
/! 

and 1.0 for BAC's. Thus, we expect roughly 1 false cut per 100Kb. 

Under this model, it is fairly trivial to see that the false cuts pose no serious problem. 
Our algorithm can be modified in a straight forward manner where Phase 1 computation 
needs to be somewhat more robust. 

In phase 1, our goal is to construct the set 



{f(hi)J(h 2 ),...J(h k )}. 
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This is accomplished by considering the observation-based sets 

{/Oil), f(s i2 ), f(sui)h 1 = !> • • • > n - 

and including only those /(sy)'s that occur at least twice in the combined observations. 
In other words, if there exists an i\ ^ i 2 such that if 

^Jlj2 f( S h,jl) = /( S «2,j2) = X ' 

then include x in the output set. 

/2(ln k + c) \ 

Assume that n > I I , (with c > 1.26). Then if hi is a true cut site, the 



P 

probability that f(hi) is not included in the output is 

{1 - p) n + np{l - p) 71 - 1 < (l+p(n- l))e- p(n " 1) 

< e -p(n-l)/2 



Proceeding as before the probability that all k true sites will be included is thus bounded 
from below by e~ e c , Also, by the assumption regarding the distribution of spurious cuts, 
we see that the probability that a spurious cut is included in the final set is zero. 

4.4 Symmetric Cuts 

Next, assume that the true ordered restriction map consists of k asymmetric cuts and 
m symmetric cuts. Thus the total number of cuts is k + 2m. Note that a cut hi is 
a symmetric cut, if both hi and h R are true cuts. Additionally, we assume that the 
observations are subject to the partial digestion errors, misorientation errors, spurious 
cut errors (determined by a Poisson process) and symmetric cuts. 

In this case, we proceed as before with phase 1 from the preceding subsection, and 
/ 2(ln k + c) \ 

again assuming that n > I I , we will almost surely (with probability no smaller 



p 

than e _e ) construct a set 

{f(hi),f(h 2 ), • • • , f{h k ), f(h k+1 ), f(h k+m )}. 

Note that the 2m symmetric sites yield m values in the folded structure when / is applied. 

However, before proceeding to phase 2, we will remove those f(hj)'s from the pre- 
ceding set that correspond to symmetric cuts. A simple approach we can take is to 
check each observation for the existence of symmetric cuts at positions s and s R , where 
f(s) = f(s R ) = f(h 3 ). 

(In in -I - c \ 
k I then the preceding steps correctly detect the sym- 
P 2 J 

metric cuts with probability greater than e~ e . Note that assuming hj to be a symmetric 
true cut, the probability that the above test fails in any particular observation is (1 — p 2 ) 
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5 DISCRETIZATION 9 

and thus the probability that the symmetric cut hj goes undetected in any of the n in- 
dependent observations is (1 —p 2 ) n . Thus the probability that all m symmetric true cut 
sites are detected in the final map is given by [1 — (1 — p 2 ) n ] m > e~ e 

Again, by the assumption regarding the distribution of spurious cuts, we see that the 
probability that a spurious cut is included or symmetric cut is missed in the final set is 
zero. 

At the end of this step, we are left with a set containing only asymmetric cuts 

{f{h l )J{h 2 ),...J{h k )}. 

At this point, we simply proceed with the phase 2 mutatis mutandis and claim results 
similar to the ones derived earlier. 

4.5 Summary 

Consider an ordered restriction map with k + 2m restriction sites, of which m are sym- 
metric cuts. Assume that the postulated experiment observes these maps, with each 
observation suffering from partial digestion error (p < 1), misorientation error, spurious 
cuts (determined by a Poisson process with parameter Aj), but no sizing error. 

Theorem 4.1 Let e be a positive constant and c > ln(5/e). Then for 
"2(ln(fe + m) + c) 21n£; + 8c lnm + c" 



n > max 



' 2 ' 2 

p p L JT 



there is a probability of at least (1 — e) that the correct ordered restriction map can be 
computed in 0(n(L + k 2 + m)) time. 
When 

ln(k + m) 

n < 



p(l +p) 

(0 < p < 0.69), there is a probability greater than half that no algorithm can compute the 
correct ordered restriction map. □ 

5 Discretization 

Next, we consider the effect of discretizing the map data by dividing it uniformly into 
several intervals of equal sizes. The main argument in favor of discretizing the data has 
been to accommodate the sizing errors that make the cut locations deviate from their 
true location. The main source for the sizing error has been the nonuniform attachment 
of the flurochromes that are necessary to visualize the DNA. We will study the effect of 
sizing error on discretization in a later section. 

Let us assume that the clone DNA that we wish to analyze is of length L bps. Let A 
represent a small subinterval and 5 = A/L. Thus the unit length is partitioned into M = 
1/5 = L/A consecutive subintervals. One assumes that it is not possible to distinguish 
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5 DISCRETIZATION 10 



the restriction cuts and spurious cuts in each of these subintervals. Thus, we need to 
ensure that 5 is significantly small so that no more than one true restriction cut location 
belongs to a subinterval. We now write r = Xj5 = AXj/L to denote the probability that 
we shall observe one spurious cut in a subinterval. Note that the probability that we shall 
observe / spurious cuts in any observation is given by 

M\ i \ f / ,m f , X f X f A 

t fry 1 - r) M ~ f , where r = -*= = -*=r- . 
f J ML 

Thus in the limit as M — > oo and r — > 0, 

Jirn^ (Xf/M)f(l - X f /M) M -f = e~ A A 

the analysis given earlier holds true. Furthermore, if we are simply interested in the 
effect of finite M (and nonzero r), we are still able to prove that for realistic values of 
r < p/27 the earlier bounds still hold. Thus, it suffices to ensure that Xf/M < p/27, or 
M > 27Xf /p — for instance, M could be 270 and satisfy the inequality as long as Xf < 1.0 
and p > 0.1. (See appendix A2.) 

Typical values for various clones may be as follows: for lambdas, M can range from 
200 to 2,000 and r xi 10" 3 -10~ 4 ; for cosmids, M is 2,000-4,000 and r x 10" 4 ; for BACs 
M ~ 15,000 and r ~ 10~ 4 . In general, even for significantly smaller (but still realistic) 
values of M, r <C p. 

5.1 Limit on M 

It is worth noting that the discretization process makes it possible for spurious cuts to 
introduce a "wrong" cut site into final map. For instance, if each of the n observations 
contains a spurious cut in the same subinterval, then no algorithm can distinguish this 
spurious cut from a true cut (independent of digestion rate). Thus the probability that 
none of the M subintervals has a spurious cut in each of the n observations is given by 

„n\M 



[1 - r n y 

ln(l/r) ' 



Now if we assume that n < ■ ^frf \ , then the above probability is bounded from above by 



(l-r n ) M < 


(1- 


r hxM/ \n(\/r)\M 




(- 




< 






~ M J 




1 


1 


< 


- < 






e 


2' 



Hence we must further guarantee that 
ln(L/A) ln(M) 



n > 



ln(l/r) ln(M/A/) 

since otherwise there is a probability of half or more that the computed map will be 
wrong. 
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6 Sizing Errors 

Next, suppose we model the sizing error and analyze its effect. Before doing so, we 
need to derive some inequalities relating the size of the discretized subinterval (A) to 
several other external parameters. In particular, in order to infer the map correctly with 

probability greater than 1 / \/2, we must guarantee that A < ( — ] , where k is 

\(k-l)p E J 

the number of cuts and pe denotes the probability that the restriction enzyme cuts at a 
site. 

( 2 \ 

Assume that A > - — . Let I denote the length of the smallest restriction 

\(k-l)p E J 

fragment (piece of the molecule between two consecutive restriction sites). Note that the 
fragment lengths are distributed as pe&~ PeX , and the probability that a fragment is of 



length > A/3 is 

/•oo 

/ p E e~ PEX dx = e~ PEA / 3 . 

JA/3 



I A/3 

Thus the probability that the smallest of all (k — 1) fragments is no smaller than A/3, is 

e -(fc-l) PE A/3 < e -2/3^ 

Thus the probability that the smallest fragment is of length < A/3 and that both ends 
of the fragment belong to the same subinterval is bounded by 

l-e" 2/3 ) (1-1/3) >l-i=. 

However, note that for a BAC clone, this implies that the largest value we may choose 
for A < 200bp (requiring M to be about 750). 

Next assume that a true cut site at location h actually appears as a Gaussian distribu- 
tion ~ N(h, a). Again, considering the complementary requirement to the one mentioned 
earlier, we must ensure that the observed cuts corresponding to the same true cut (at 
location hi) belong to the same subinterval with high probability. As a result, we may 
require that 

Vi<i<fc 3i<i<M K £ (jA + a, (J + 1)A - a), 

with high probability (say, > 1/V2). Thus, we require that 

2a 



e -2ka/A > J_ 

" y/2' 



In other words, we require that 2ka/A < In 2/2, and 
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7 EXPERIMENTAL VERIFICATION 12 

A simple calculation for the BAC example reveals that in order to guarantee the above 
inequality we need that a < 0.89bp. Thus for all practical purposes, in order for the dis- 
cretized algorithm to work with any degree of correctness, we must require the observation 
to be free of sizing error. As a result, one can explain why several algorithms devised to 
work with dicretization failed, while purely continuous versions (or some combination) 
have done well. 

7 Experimental Verification 

This section compares the performance of a program based on the maximum likelihood 
approach to map-computation (described in [AMS97]) with the theoretical bounds in the 
previous sections. At the time of this writing, AMS algorithm [AMS97] still remains the 
only algorithm that has worked successfully on raw experimental data, without access to 
any extraneous parameters or the final answer. In each case, when the computed map was 
verified with data (from sequence and gel data) derived independently and subsequent to 
the experiment, the algorithm was found to be remarkably successful. 

For all the experiments described in this section, random data were generated using 
the data models of the previous sections. For each data model and assumed number of 
data molecules, we generated 20 random data samples and counted the fraction of these 
samples for which the maximum likelihood program computed the correct map. For each 
data model the number of data molecules is varied to obtain the fraction of cases solved 
correctly as a function of the number of data molecules. We show that in each case there 
is a fairly sharp transition from not being able to solve any of the 20 samples to being able 
to solve all 20 samples. Moreover this transition point lies within the theoretical bounds 
computed in the previous sections. Finally we examine the performance of the maximum 
likelihood program for the case where there is significant sizing error. In this case the 
discrete methods described previously fail to work altogether, whereas the maximum 
likelihood method continues to work, albeit requiring a larger number of data molecules 
as the sizing error increases. 

The maximum likelihood approach described in [AMS97] is based on a continuous 
(non-discrete) modeling of the data. The modeling of sizing error in the model results 
in a singularity in the probability density when the sizing error is zero. Therefore this 
case was approximated by assuming a small sizing error of 10~ 5 of the total molecule 
size, a = 1.5bp. Each data model is specified by providing the number (k) and value of 
the actual cut locations, the sizing error in the form of a standard deviation (a), a digest 
rate (p) and a false cut rate (A/). For each model, random data is generated with the 
help of a random number generator in a straight forward fashion: For each of the actual 
cuts, we draw a random number uniformly from [0,1], and if this value is below p, the 
cut is assumed to be present. We then draw another random number from the standard 
Gaussian distribution to determine the location of the cut, thus modeling the effect of 
sizing error. Next, false cuts are added by first drawing a random sample from a Poisson 
distribution with mean Xf to determine the number of false cuts, and then drawing the 
required number of random samples uniformly over [0,1] to get the false cut locations. 
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Figure 1: Experimental Results: #Cuts, k = 37, a = 1.5bp, p = 0.1 

This step results in the generation of one in-silico "molecule." This process is repeated 
to get the required number of molecules to make up one data set. This data set is then 
input as raw data to our maximum likelihood program, and the resulting map is scored a 
success if the number of cuts is correct, and the location of each cut is within one standard 
deviation (<r) of the correct location. (Note that a is the standard deviation for the cuts 
of one sample molecule: the map computed by the AMS algorithm typically has a sizing 
error much less than that since the data from all molecules are averaged). This process 
is repeated for a total of 20 samples and the fraction of times the program succeeds is 
recorded against the data sample parameters (k, a, p, Xf, number of molecules). The 
whole process (i.e., the one generating 20 samples) was repeated for different values of 
the parameters. The number of cuts was varied using the values k = 0, 1, 2, 5, 10, 20 
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Figure 2: Experimental Results: #Cuts, k = 37, a = 1.5bp, p = 0.2 

and 37. The values of p tested were p=0.W and p=0.20. The values of A/ tested were 
A/=0 (no false cuts) and Xf=l,2 and 4. For most experiments we selected a = 1.5bp to 
approximate no sizing error, but for a small number of experiments with k = 2 and 37 we 
also tested a= 150bp, 300bp, 750bp and 1.5Kb. Most experiments were repeated with 
the number of molecules set at 10, 20, 30, 40, 50, 70, 100, 200, 500 and 1000, and in a 
few instances 2000 or 5000. 

The results are summarized in a series of graphs showing the success rate (out of 20 
samples) as a function of the number of molecules used. The graph in Figure 1 shows 
the case for k = 37 and Xf = 0, 1, 2 and 4, which corresponds to the case analyzed 
in Section 4. We see that for p = 0.10 and = 1 a sharp transition occurs when 
the number of molecules increases from 30 to 50. At 70 or more molecules the AMS 



Downloaded from http://biorxiv.org/on September 18, 2014 



7 EXPERIMENTAL VERIFICATION 15 



'(♦cuts -37, SD -150bp) Prob (#cuts -37, SD -300bp) 




Figure 3: Experimental Results: #Cuts, k = 37, Xf = 1, p = 0.1 

algorithm never (out of 20 experiments) fails to find the correct map, whereas for 20 or 
less molecules it invariably fails to find the correct map. For p = 0.20 (Figure 2), the 
transition (from probability of near 0 to near 1) occurs at a lower value of around 20-30 
molecules. Compare this with the theoretical bounds on the number of molecules required 
from section 4 of between 30 and 100 (lower bound and upper bound respectively). 

When the number of (true) cuts in the molecules is changed to k = 20, 10, 5 and 
2, similar graphs are obtained: Figures 4 and 5 show the results for the case k = 20; 
Figures 6 and 7, for the case k = 10; Figures 8 and 9, for the case k = 5; Figures 10 and 
11, for the case k = 2; Figures 13 and 14, for the case k = 1 and Figure 15, for k = 0. 
The main trend is an increase in the number of molecules required as k is reduced down 
to k = 2: for instance, with k = 2 and p = 0.1, 500 molecules are required to find the 



Downloaded from http://biorxiv.org/on September 18, 2014 



7 EXPERIMENTAL VERIFICATION 16 



Prot i#cuts -20, LambdaF -0) Prol )#cuts -20, LambdaF -1) 




Figure 4: Experimental Results: #Cuts, k = 20, a = 1.5bp, p = 0.1 

correct map in every case (A/ = 0, 1, 2 and 4), in contrast to just 200 for k = 37. This 
observation agrees with the theory from sections 4 and 5 which shows that the bounds 
increase slowly as k is decreased. However, the case k = 1, Figures 13 and 14, show that 
fewer molecules are required: e.g., with p = 0.1 and Xt = 1, 200 molecules are sufficient 
to find the correct map. The reason is that orientation is less of a problem with only 1 
cut. 

Figure 3 shows what happens with k = 37 when the sizing error is increased to 
150bp, 300bp, 750bp and 1.5Kb, respectively. With p = 0.10 and Xf = 1, the number of 
molecules required to find the correct map in every case increases from 200 to about 5000 
as the sizing error increases. Figure 12 shows what happens at k = 2 when sizing error 
is increased similarly. In this case the number of molecules increases from an already 
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Figure 5: Experimental Results: #Cuts, k = 20, a = 1.5bp, p = 0.2 

larger value, but more slowly: it increases from 500 to 2000. While we do not have any 
theoretical bounds for this case, the intuition is that while it is harder to get the correct 
orientation with k = 2 than with k = 37, it is less likely that neighboring cuts will be 
confused with each other due to sizing errors when k = 2 than when k = 37. 

8 Discussion: Genomic Mapping 

The strategies for genome-wide genotype or haplotype mapping using single molecule 
optical maps are similar in spirit to the approaches for clone mapping, described in this 
paper; but there are also several differences in the details of the implementation as new 
and powerful heuristics need to be incorporated in order to tame the computational 
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Figure 6: Experimental Results: #Cuts, k = 10, a = 1.5bp, p = 0.1 

complexity of searching over the hypotheses space. The details of the algorithm can be 
found elsewhere [AnMS99]. We summarize below a 0-1 law applicable to the experiment 
design in this genomic-mapping setting; the derivation of the following result is in [AM01]. 

Consider an optical mapping experiment for genome-wide shotgun mapping for a 
genome of size G and involving M molecules each of length L^. Thus the coverage is 
MLd/G. Let the a fragment of true size X have a measured size ~ J\f(X,a 2 X). Let 
the average true fragment size be L, and the digestion rate of the restriction enzyme 
be p. Thus the average relative sizing error R = o\JpjL and the average size of aligned 
fragments will be L/p 2 . As usual, let 6 represent the minimum "overlap threshold." Hence 
the expected number of aligned fragments in a valid overlap is at least n = OL^p 2 / L. Let 
d = l/p, the inverse of the digest rate. Feasible experimental parameters are those that 
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Figure 7: Experimental Results: #Cuts, k = 10, a = 1.5bp, p = 0.2 



result in an acceptable (e.g. < 10 ) False Positive rate FPT: 



FPT 



2AP 



[2nd + 2] \ (Ryff ) n 2(d ^L nR 
[2n(d - 1)J J yjmt & 



To achieve acceptable false positive rate, one needs to choose an acceptable value 
for the experimental parameters: p, a, and coverage. FPT exhibits a sharp phase 
transition in the space of experimental parameters. Thus the success of a mapping project 
depends extremely critically on a prudent combination of experimental errors (digestion 
rate, sizing), sample size (molecule length and number of molecules) and problem size 
(genome length). Relative sizing error can be lowered simply by increasing L with a 
choice of rarer-cutting enzyme and digestion rate can be improved by better chemistry. 



Downloaded from http://biorxiv.org/on September 18, 2014 



8 DISCUSSION: GENOMIC MAPPING 20 



'(♦cuts -5, LarabdaF -0) Prob (#cuts -5, LarabdaF -1) 




Figure 8: Experimental Results: #Cuts, k = 5, a = 1.5bp, p = 0.1 

As an example, for a human genome of size G = 3, 300M6 and a desired coverage 
of 6x, consider the following experiment. Assume a typical value of molecule length 
Ld = 2Mb. If the enzyme of choice is PAC I, the average true fragment length is about 
25Kb. Assume a minimum overlap 1 of 9 = 30%. Assume that the sizing error for a 
fragment of 30kb is about S.Okb, and hence a 2 = 0.3kb. With a digest rate of p = 82% 
we get an unacceptable FPT ~ 0.0362. However just increasing p to 86% results in an 
acceptable FPT ~ 0.0009. Alternately, reducing average sizing error from 3.0kb to 2Akb 
while keeping p = 82% also produces an acceptable FPT w 0.0007. 
Acknowledgment. Our thanks go to Naomi Silver, Rohit Parikh, Raghu Varadhan, 
Joel Spencer, Alan Frieze, Sylvain Cappel, Bruce Donald, Mike Wigler and Laxmi Parida 



lr This value should be selected to minimize FPT. 
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Figure 9: Experimental Results: #Cuts, k = 5, a = 1.5bp, p = 0.2 
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Figure 12: Experimental Results: #Cuts, k = 2, At = 1, p = 0.1 
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Figure 14: Experimental Results: #Cuts, k = 1, a = 1.5bp, p = 0.2 
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Al. Bound for section 4.2 



Assume the same notation as in section §4.2: We can provide a somewhat better bound 
when p is small, i.e., p ~ 1/k. 

Let a be a function of p and k: 



a 



l-{l-p) k - kp(l - p)*- 1 > 1 - (1 + p(k - l))e" p(A; - 1) , 
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Figure 15: Experimental Results: #Cuts, k = 0 



and let 

n > 



*) in 
2a J 



k 



k — Ink 



where k — In k > c. 



Construct a random subgraph Gr = (V, En) of G as follows: For any given observa- 
tion with two or more cuts choose one edge at random from all the possible edges that 
the observation contributes to G. Discard those observations with fewer than two cuts. 
Thus with every observation, when we add an edge we do so uniformly randomly and 
independent of all the other edges chosen in Gr. Note that the probability that an ob- 
servation has two or more cuts is a and the probability that an edge is added to Gr in 
an observation is 2a/k(k — 1) > 2a/k 2 . 
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Table 1: Summary of Experimental Results. Number of molecules necessary as functions 
of the parameters: #Cuts, k £ {0..37}, Digest rate p £ {0.1,0.2} and X f £ {0, 1,2,4}. 



For any pair [/(/ij), f(hj)], the probability that this edge does not occur in Gr is less 
than 

k 2 J ~ k ' 

and thus the "edge-probability," p e (the probability that this edge occurs in Gr) is 
Ink + c N 



Pe > 



k 



Thus by the well-known result on the connectivity in random graphs [Spe87] , we see that 

(\i\k + c 

with p e > — 

V k 



k— >oo 

Note that 



lim Pr [Gk, Pe is connected] = e 



a > min 



4 p 2 (k-l) 2 
5' 



16 



and if k 3> c then 
k 



In 



< (l + o(l)) 



In k + c 



k — In k — cj V k 

Thus it suffices for our purpose to choose 



n > max 



" +< < 1 >y* + .), g+ f>' 11 * +c 

p z k 
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Theorem 8.1 Let e be a positive constant and c > ln(2/e). Then for 
In k + c 



n > max 



i±^), (ln , + c ,,(^)^ 

a probability at least 1 — e, i/ie correct ordered restriction map can be computed in 
0(nk 2 ) time. □ 

Furthermore, we conclude that, for p > 1/k, n = 0{k\ogk) observations suffice to 
find the true map without any other prior knowledge of p. 

A2. Discretization 

As before, let us assume that the clone DNA is of length L bps. Let A represent a small 
subinterval and 5 = A/L. Thus the unit length is partitioned into M = 1/5 = L/A 
consecutive subintervals. We write r = Xfd = AXf/L to denote the probability that we 
shall observe one spurious cut in a subinterval. 

Typical values for various clones may be as follows: for lambdas, M can range from 
200 to 2,000 and r 10" 3 -10" 4 ; for cosmids, M is 2,000-4,000 and r ps 10~ 4 ; for BACs 
M ~ 15,000 and r ~ 10 -4 . In general, even for significantly smaller (but still realistic) 
values of M, r <C p. 

Bounds 

We write p = p + r — pr to denote the probability that a subinterval contains a true or 
spurious cut site. We will use the following simplifying assumption: 

27r < p. 

More precisely, p/6r > 2e — 1. 

We summarize the bounds as follows: 

Theorem 8.2 Let e be a positive constant and c > ln(5/e). Then for 

9 I" . . 2(lnA; + c) lnm + c 

n > — max ln(k + m) + c, , , 

PI p p 

(ln(L/2A - k - m) + c) 

(L > 2 A and r < p/27), the probability that the correct ordered restriction map can be 
computed in 0(n(L + k 2 + m)) time is at least 1 — e. □ 

We will now introduce two parameters ei = p/6r and eo, and guarantee that e\ > 2e— 1 
and eo > 1/2. Furthermore, we have 

(1 + ei) < p/4r. 
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Phase 1 a 

In phase 1 a, our goal is to construct the set 

{f(hi),f(h 2 ),...J(h k+m )}, 
by considering the observation-based sets 

{/Oil), f(si2), f(siii)}, i = l,...,n. 

and including only those /(sy)'s that occur in significantly large numbers of times, deter- 
mined by a threshold Th\. Suppose that a location f(h) corresponds to a true location, 
then the number of /(sjj)'s equal to f(h) must follow a Binomial distribution ~ S(n,p), 
if it is an asymmetric cut and ~ S(n,2p), if it is a symmetric cut. If on the other hand, 
f(h) does not correspond to any true location, then the number of /(sy)'s equal to this 
f(h) must follow a Binomial distribution ~ S(n,2r). 
If we set the threshold at 

Thi = (1 + ei)2nr < f 2- ) 2nr < — . 

\4ry 2 

then T/i! = (1 — eo)wp, where eo > |. 

By assumption (statement of the theorem above), 



n > — max [\n(k + m) + c, ln(M/2 — k — m) + cl 
Thus (using the Chernoff bound [ASE92]) 



Pr 



S(n,p) < (1 - e 0 )np 



< e -(^/ 2 )™P 

< e-^/ 8 < ' 



k + m 



Thus the probability that all the correct cuts appear in the computed set is bounded 
from below by e~ e 

Again, using the Chernoff bound [ASE92] in the other direction, we get 



Pr 



S{n,2r) > (l + ei)2nr 



<- 2 _ (l+ei)2nr ^ 2 _ (p/6r)2nr 
< e -((ln2)/3)np 



< e 



-np/8 



< 



M/2 - k-m 

Thus the probability that no spurious cut appears in the computed set > e~ e 
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Phase 1 b 

In phase 1 b, our goal is to construct the set of asymmetric cuts 

{f{h 1 )J(h 2 ),...J{h k )} : 

by eliminating the symmetric cuts. Suppose that a location f(h) corresponds to a sym- 
metric true cut site, then the number of times an observation has sites at s' = f(h) and 
s" = f{h) R must follow a Binomial distribution ~ S(n,p 2 ). If on the other hand, f(h) is 
not a symmetric site, then the corresponding number must follow a Binomial distribution 
~ S(n,pr). 

If we set the threshold at 

( v\ n P 2 
Tri2 = (1 + eijnpr < — npr < ——. 

\4r/ 2 

then Tfi2 = (1 — eo)np 2 , where eo > \- 

By assumption (statement of the theorem above), 



n > — max 
pi 



8(lnm + c), ( — - ) (In A; + c) 
In 2 



Thus (using the Chernoff bound [ASE92]) 



Pr 



S(n,p 2 ) < (1 - e 0 )np 2 



< e 



-(eg/2)np2 



< e" 



< 



Thus the probability that all the symmetric cuts are correctly classified is bounded 
from below by e~ e 

Again, using the Chernoff bound [ASE92] in the other direction, we get 

Pr S(n,pr) > (1 + e\)npr 

((In2)/6)nj3 2 e 



< e" 



< 



A; 



Thus the probability that no symmetric cut is misclassified is bounded from below by 



Phase 2 

The proof proceeds in a manner similar to the one given for the non-discretized case. In 
phase 2, our goal is to assign consistent sign labels to the asymmetric cuts 



{f(hi),f(h 2 ),...J(h k )}, 
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so that the final map can be constructed correctly with high probability. 

Let Sftj denote the set of observations containing a cut site matching f(hi), and Si 
denote the set containing a true cut site matching f(hi). Note that {S^l > \S\ | and 
\Sfr | follows a Binomial distribution ~ S(n,p). Using the Chernoff bound, we have 



Pr 



9 

S(n, p) < -(Ink + c) 
P 



< e 



-np/8 



< e 



Let n\ be the number cut sites matching h\. Thus 

8(v + r) 8 
ni > -af ^(lnfc + c) = —Qnk + c), 

with a probability > 1 — e _c . Here (3 + = (p 2 + r 2 )/(p + r). 

Consider a potential edge [f(hi),f(hi)]. Let denote the number of times two cut 
sites matching f(hx) and f(hi), respectively, appear in the same half [either in (0, 1/2) 
or in (1/2, 1)] in an observation in S^. If the correct edge labeling is +1 then nj has a 
Binomial distribution ~ S(ni,(3 + ), where /?+ = (p 2 + r 2 )/(p + r). If on the other hand, 
the correct edge labeling is —1 then rij has a Binomial distribution ~ S(ni,f3-), where 
/9_ = (2pr)/(p + r)). 

Set the threshold at 



Th* 



(i + ei)m/3_ < ( |: ) "i 



< 



2(p + r) 



< 



Thus r/13 = (1 — eo)rti/3+, where eo > \- 

Thus (using the Chernoff bound [ASE92]) 



Pr 



S(m,p+) < (l-e 0 )ni/3+ 



< e -(4/2) ni f3+ < e -nif3 + /8 



< 



k 



Again, using the Chernoff bound [ASE92] in the other direction, we get 



Pr 



5(n l5 /3_) > (l + ei)m/3_ 



< 2 



< e 



-(l+ei)ni/?_ 



< 2 



-((ln2)/3)m/3+ 



< 



-(l+p/6r)ni/3_ 



A' 



Thus it follows that the probability that all the edges receive the correct edge labeling 
is > (1 — e~ c )e~ e c . This concludes the proof. 



